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Abstract 

Complex, non-additive genetic interactions are common and can be critical in determining phenotypes. 
Genome- wide association studies (GWAS) and similar statistical studies of linkage data, however, assume 
additive models of gene interactions in looking for associations between genotype and phenotype. In 
general, these statistical methods view the compound effects of multiple genes on a phenotype as a 
sum of partial influences of each individual gene and can often miss a substantial part of the heritable 
effect . Such methods do not make use of any biological knowledge about underlying gcnotypc-phcnotypc 
mechanisms. Modeling approaches from the Artificial Intelligence field that incorporate deterministic 
knowledge into models while performing statistical analysis can be applied to include prior knowledge in 
genetic analysis. We chose to use the most general such approach, Markov Logic Networks (MLNs), that 
employs first-order logic as a framework for combining deterministic knowledge with statistical analysis. 
Using simple, logistic regression-type MLNs we have been able to replicate the results of traditional 
statistical methods. Moreover, we show that even with quite simple models we are able to go beyond 
finding independent markers linked to a phenotype by using joint inference that avoids an independence 
assumption. The method is applied to genetic data on yeast sporulation, a phenotype known to be 
governed by non-linear interactions between genes. In addition to detecting all of the previously identified 
loci associated with sporulation, our method is able to identify four additional loci with small effects on 
sporulation. Since their effect on sporulation is small, these four loci were not detected with standard 
statistical methods that do not account for dependence between markers due to gene interactions. We 
show how gene interactions can be detected using more complex models, which in turn can be used as 
a general framework for incorporating systems biology with genetics. Such future work that embodies 
systems knowledge in probabilistic models is proposed. 
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Author Summary 

We have taken up the challenge of devising a framework for the analysis of genetic data that is fully 
functional in the usual statistical correlation analysis used in genome-wide association studies, but also 
capable of incorporating prior knowledge about biological systems relevant to the genetic phenotypes. 
We develop a general genetic analysis approach that meets this challenge. We adapt an AI method for 
learning models, called Markov Logic Networks, that is based on the fusion of Markov Random Fields 
with first order logic. Our adaption of the Markov Logic Network method for genetics allows very complex 
constraints and a wide variety of model classes to be imposed on probabilistic, statistical analysis. We 
illustrate the use of the method by analyzing a data set based on sporulation efficiency from yeast, in 
which we demonstrate gene interactions and identify a number of new loci involved in determining the 
phenotype. 

Introduction 

Genome-wide association studies (GWAS) have allowed the detection of many genetic contributions to 
complex phenotypes in humans (see www.genome.gov). Studies of biological networks of different kinds, 
including genetic regulatory networks, protein-protein interaction networks and others, have made it clear, 
however, that gene interactions are abundant and are therefore of likely importance for genetic analysis [T]. 
Complex, non-additive interactions between genetic variations are very common and can potentially play 
a crucial role in determining phenotypes [5]-[5]. GWAS and similar statistical methods such as classical 
QTL studies generally assume additive models of gene interaction that attempt to capture a compound 
effect of multiple genes on a phenotype as a sum of partial influences of each individual gene [Bl[7] . These 
statistical methods also assume no biological knowledge about the underlying processes or phenotypes. 
Since biological networks are complex, and since variations are numerous, unconstrained searches for 
associations between genotype and phenotype require large population samples, and can succeed only 
in detecting a limited range of effects. Without imposing any constraints based on biological knowledge 
searching for gene interactions is very challenging, particularly when input data consist of different data 
types coming from various sources. 

The major question that motivated this work is " Can we constrain traditional statistical approaches 
by using biological knowledge to define some known networks that influence patterns in the data, and 
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can such approaches produce more complete genetic models?" For example, we might use the patterns 
present in the genotype data to build more predictive models based on both genotype and phenotype 
data. Note that the problem of using biological knowledge to constrain a model of genetic interaction is 
closely connected to the problem of integrating various types of data in a single model. In this article 
we employ a known Artificial Intelligence (AI) approach (Markov Logic Networks) to reformulate the 
problem of defining and finding genetic models in a general way and use it to facilitate detection of 
non-additive gene interactions. This approach allows us to lay the foundations for studies of essentially 
any kind of genetic model, which we demonstrate for a relatively simple model. 

Markov Logic Networks (MLNs) is one of the most general approaches to statistical relational learning, 
a sub-field of machine learning, that combines two kinds of modeling: probabilistic graphical models, 
namely Markov Random Fields, and first-order logic. Probabilistic graphical models, first proposed by 
Pearl [5], offer a way to represent joint probability distributions of sets of random variables in a compact 
fashion. A graphical structure describing the probabilistic independence relationships in these models 
allows the development of numerous algorithms for learning and inference and makes these models a 
good choice for handling uncertainty and noise in data. On the other hand, first-order logic allows us 
to represent and perform inferences over complex, relational domains. Propositional (Boolean) logic, 
which biologists are most familiar with, describes the truth state on the level of specific instances, while 
first-order logic allows us to make assertions about the truth state of relations between subsets (classes) of 
instances. Moreover, using first-order logic we can represent recursive and potentially infinite structures 
such as Markov chains where a temporal dependency of the current state on the state at the previous 
time step can be instantiated to an infinite time series. Thus, first order logic is a very flexible choice for 
representing general knowledge, like that we encounter in biology. 

MLNs merge probabilistic graphical models and first-order logic in a framework that gains the benefits 
of both representations. Most importantly, the logic component of MLNs provides an interface for adding 
biological knowledge to a model through a set of first-order constraints. At the same time, MLNs can 
be seen as a generalization of probabilistic graphical models since any distribution represented by the 
latter can be represented by the former, and this representation is more compact due to the first-order 
logic component. Even so, various learning and inference algorithms for probabilistic graphical models 
are applicable to MLNs and are thereby enhanced with logic inference. 

One key advantage of logic-based probabilistic modeling methods, and in particular MLNs, is that 
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they allow us to work easily with data that arc not independent and identically distributed (not i.i.d.). 
Many statistical and machine learning methods assume that the input data is i.i.d., a very strong, and 
usually artificial, property that most biological problems do not share. For instance, biological variables 
most often have a spatial or temporal structure, or can even be explicitly described in a relational database 
with multiple interacting relations. MLNs thus provide a means for non-i.i.d. learning and joint inference 
of a model. While the input data used in GWAS and in other genetic studies are rich in complex statistical 
intcrdependencies between the data points, MLNs can easily deal with any of these data structures. 

There are various modeling techniques that employ both probabilistic graphical models and first-order 
logic [9l417j . Many of them impose different restrictions on the underlying logical representation in order 
to be able to map the logical knowledge base to a graphical model. One common restriction employed, for 
example, in ^2lll3l[T5l[T6] is to use only clausal first-order formulas of the form b\ A 62 A ... A b n =$> h that 
are used to represent cause-effect relationships. The majority of the methods, such as those introduced 
in [TUl [TT], [T5] , use Bayesian networks, directed graphical models, as the probabilistic representation. 
However, there are a few approaches [16,17 that instead use Markov Random Fields, undirected graphical 
models, to perform inferences. 

We use Markov Logic Networks |T7] that merge unrestricted first-order logic with Markov Random 
Fields, and as a result use the most general probabilistic logic-based modeling approach. In this paper 
we show this MLN-based approach to understanding complex systems and data sets. Similar to [18] 
that proposed a Bayesian approach that can be used to infer models for QTL detection, our MLN- 
based approach is a model inference method that goes beyond just hypothesis testing. Moreover, in 
this paper we describe how we have adapted and applied MLN to genetic analysis so that complex 
biological knowledge can be included in the models. We have applied the method to a relatively simple 
genetic system and data set, the analysis of the genetics of sporulation efficiency in the budding yeast 
Saccharomyces cerevisiae. In this system, recently analyzed by Cohen and co-workers |19| , two genetically 
and phcnotypically diverse yeast strains, whose genomes were fully characterized, were crossed and the 
progeny studied for the genetics of the sporulation phenotypc. This provided a genetically complex 
phenotype with a well-defined genetic context to which to apply our method. 
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Methods 

Markov Random Fields 

Given a set of random variables of the same type, X = {Xi : 1 < i < N} and a set of possible values 
(alphabet) A = {Aj : 1 < i < M} so that any variable can take any value from A to Am (it is easy 
to extend this to the case of multiple variable types). Consider a graph, G, whose vertices represent 
variables, X, and whose edges represent probabilistic dependencies among the vertices such that a local 
Markov property is met. The local Markov property for a random variable Xi can be formally written 
as Pr(JQ = Aj | X \ {AA) = Pr(JQ = Aj \ N(Xi)) that states that a state of random variable Aj is 
conditionally independent of all other variables given A^'s neighbors N(Xi), N(Xi) C X\ {A^}. Let C 
denote the set of all cliques in G, where a clique is a subgraph that contains an edge for every pair of its 
nodes (a complete subgraph). Consider a configuration, 7, of X that assigns each variable, Xi, a value 
from A. We denote the space of all configurations as T = A x . A restriction of 7 to the variables of a 
specific clique G is denoted by jc- 

A Markov Random Field (MRF) is defined on X by a graph G and a set of potentials V = {Vc('Jc) '■ 
C € C, 7 G r} assigned to the cliques of the graph. Using cliques allows us to explicitly define the topology 
of models, making MRFs convenient to model long-range, higher-order connections between variables. 
We encode the relationships between the variables using the clique potentials. By the Hammersley- 
Clifford theorem, a joint probability distribution represented by an MRF is given by the following Gibbs 
distribution 

Pr (7) = | LI cxp(-Vfc(7c)), (1) 
cec 

where the so-called partition function Z = X) 7 er Ilcec cx p( — Vcilc)) normalizes the probability to 
ensure that X) 7 er P r (7) = 1- 

Without loss of generality we can represent a Markov Random Field as a log- linear model [50] : 



Pr( 7 ) = 




where fi : T — > M are functions defining features of the MRF and Wi G K are the weights of the MRF. 
Usually, the features are indicators of the presence or absence of some attribute, and hence are binary. 
For instance, we can consider a feature function that is 1 when some Xi has a particular value and 
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otherwise. Using these types of indicators, we can make M different features for Xi that can take on 
M different values. Given some configuration 7x\{Xi} of a U the variables X except Xi, we can have a 
different weight for this configuration whenever Xi has a different value. The weights for these features 
capture the affinity of the configuration jx.\{Xi} for each value of Xi. Note that the functions defining 
features can overlap in arbitrary ways providing representational flexibility. 

One simple mapping of a traditional MRF to a log-linear MRF is to use a single feature fa for each 
configuration 7c of every clique C with the weight w; = — Vc(7c)- Even though in this representation the 
number of features (the number of configurations) increases exponentially as the size of cliques increases, 
the Markov Logic Networks described in the next section attempt to reduce the number of features 
involved in the model specification by using logical functions of the cliques' configurations. 

Given an MRF, a general problem is to find a configuration 7 that maximizes the probability Pr(7). 
Since the space T is very large, performing an exhaustive search is intractable. For many applications, 
there are two kinds of information available: prior knowledge about the constraints imposed on the 
simultaneous configuration of connected variables; and observations about these variables for a partic- 
ular instance of the problem. The constraints constitute the model of the world and reflect statistical 
dependencies between values of the neighbors captured in an MRF. For example, when modeling gene 
association with phenotype, the restrictions on the likelihood of configurations of co-expressed genes may 
be cast as an MRF with cliques of size 2 and 3 (see figure [T]). In the next section, we give a biological 
example involving construction of MRFs with cliques of size 3 and 4, and provide more mathematical 
details. 




Figure 1. An example of a Markov Random Field. The MRF represents four genes influencing a 
phenotype with potentials Vi , . . . , Va (blue edges) . This model restricts genetic interactions to two 
pair- wise interactions with potentials V5 , V§ (red cliques) . 
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Markov Logic Networks: Mapping First-Order Logic to Markov Random 
Fields 

Markov Logic Networks merge Markov Random Fields with first-order logic. In first-order logic (FOL) 
we distinguish constants and variables that represent objects and their classes in a domain, as well 
as functions specifying mappings between subgroups of objects, and predicates representing relations 
among objects or their attributes. We call a predicate ground if all its variables are assigned specific 
values. To illustrate, consider a study of gene interactions through the phenotypic comparison of wild 
type strains, single mutants, and double mutants such as the one presented in [3]. Consider a set of 
constants representing genes, {gl,g2}, gene interaction labels, {A,B}, and difference between phenotype 
values of mutants, {0, 1, 2}. We define the following predicates: RelWS/2 (a 2-argument predicate which 
captures a relation between a wild type and a single mutant), RelWD/3 (a relation between a wild type 
and a double mutant), RelSS/3 (a relation between two single mutants), RelSD/4 (a relation between 
a single mutant and a double mutant), Int/3 (an interaction between two genes). Using FOL we can 
define a knowledge base consisting of two formulas: 

Vx,ye {gl,g2},Vce {0,1}, Vv,ue {0,1,2}, Int(x,y,c)^ (RelWS(x,v)^RelSD(y,x,y,u)) 

Vx,y e {gl,g2},Vc e {0, l},Vv,u,we {0, l,2},RelWS(x,v) ARelWS(y,u) A RelWD(x, y, w) => Int(x,y, c). 

The first rule represents the knowledge that depending on the type of interaction between two genes, 
there is a dependency between RelWS(x,v) and RelSD(y, x, y, u) relations. The second rule captures the 
knowledge that three relations, RelWS(x, v), RelWS(y,u), and RelWD(x, y, w), together determine the type 
of gene interaction. 

Note that first-order formulas define relations between (potentially infinite) groups of objects or their 
attributes. Formulas in FOL can be seen as relational templates for constructing models in preposi- 
tional logic. Therefore, FOL offers a compact way of representing and aggregating relational data. For 
example, two first-order formulas above can be replaced with 288 prepositional formulas since variables 
x, y, c can be assigned 2 different values and variables u, v, w can be assigned 3 values. Moreover, us- 
ing representational power of FOL, we can specify infinite structures such as temporal relations, e.g., 
Expression(el, tl) A NextTimeStep(tl, t2) => Expression(e2, t2), that can give rise to a theoretically 
infinite number of propositions. 
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The principal limitation of any strictly formal logic system is that it is not suitable for real applications 
where data contain uncertainty and noise. For example, the formulas specified earlier hold for the real 
data most of the time, but not always. If there is at least one data point where a formula does not hold, 
then the entire model is disregarded as being false. The two allowed states, true or false, is equivalent 
to allowing only probability values 1 or 0. Markov Logic Networks, however, relax this constraint by 
allowing the model with unsatisfied formula with a lesser probability than one. The model with the 
smallest number of unsatisfied formulas will then be the most probable. 

Markov Logic Networks (MLNs) extend FOL by assigning a weight to each formula indicating its 
probabilistic strength. An MLN is a collection of first-order formulas Fi with associated weights Wi . For 
each variable of a Markov Logic Network there is a finite set of constants representing the domain of the 
variable. A Markov Logic Network together with its corresponding constants is mapped to a Markov 
Random Field as follows. 

Given a set of all predicates on an MLN, every ground predicate of the MLN corresponds to one 
random variable of a Markov Random Field whose value is 1 if the ground predicate is true and 
otherwise. Similarly, every ground formula of Fi corresponds to one feature of the log-linear Markov 
Random Field whose value is 1 if the ground formula is true and otherwise. The weight of the feature 
in the Markov Random Field is the weight Wi associated with the formula Fi in the Markov Logic Network. 

From the original definitions (fl) and © and the fact that features of false ground formulas are equal 
to 0, the probability distribution represented by a ground Markov Logic Network is given by 

Pr (7) = ^cxp (^Wirni-y)] = — ]J CX P , (3) 

where ^(7) is a number of true ground formulas of the formula Fi in the state 7 (which directly cor- 
responds to our data), 7, is the configuration (state) of the ground predicates appearing in Fi. Vi is a 
potential function assigned to a clique which corresponds to Fi, and exp(— ^(7,;)) = exp(iUj). Note that 
this probability distribution would change if we changed the original set of constants. Thus, one can 
view MLNs as templates specifying classes of Markov Random Fields, just like FOL templates specifying 
propositional formulas. 

Figure [2] illustrates a portion of a Markov Random Field corresponding to the ground MLN. We 
assume a set of constants and an MLN specified by the knowledge base from the example above, where 
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a weight is assigned to each formula. 




Figure 2. An example of a subnetwork of a Markov Random Field unfolded from a Markov 
Logic Network program. Each node of the MRF corresponds to a ground predicate (a predicate 
with variables substituted with constants). The nodes of all ground predicates that appear in a single 
formula form a clique such as the one highlighted with red. The blue triangular cliques correspond to 
the first formula of the MLN and are assigned the weight of the formula (2.1). The larger rectangular 
cliques, such as the one colored red, correspond to the second formula of the MLN with the weight 1.5. 

An Example: Application to Yeast Sporulation Dataset 

We applied our method to a data set generated by Cohen and co-workers [19]. They generated and 
characterized a set of 374 progeny of a cross between two yeast strains that differ widely in their efficiency 
of sporulation (a wine and an oak strain). For each of the progeny the sporulation efficiency was measured 
and assigned a normalized real value between and 1. To generate a discrete value set we then binned 
and mapped the sporulation efficiencies into 5 integer values. Each yeast progeny strain was genotyped at 
225 markers that are uniformly distributed along the genome. Each marker takes on one of two possible 
values indicating whether it derived from the oak or wine parent genotype. 

Using Markov Logic Networks, we first model the effect of a single marker on the phenotype, i.e., 
sporulation efficiency. Define a logistic-regression type model with the following set of formulas: 

Vstrain € {1, . . . , 374}, G(strain, m, g) E(strain, v), w m g v , (4) 

for every marker m under consideration (at this point we consider one marker in a model), genotype value 
g G {A, B}, and phenotype value v S {!,..., 5}. This Markov Logic Network contains two predicates, G 
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and E. Predicate G denotes markers' genotype values across yeast crosses, e. g., G(strain, Ml, B) captures 
all yeast crosses for which the genotype of a marker Ml is B. Similarly, predicate E denotes the phenotype 
(sporulation efficiency) across yeast crosses, for instance, E(strain, 1) captures all yeast strains for which 
the level of sporulation efficiency is 1. The Markov Logic Network ((4]) contains 10 formulas, 1 marker 
of interest times 2 possible genotype values times 5 possible phenotype values. Each formula represents 
a pattern that holds true across all yeast crosses (indicated by the variable strain) with the same 
strength (indicated by the weight w m g!V ). In other words, the weight w m g . v represents the fitness of the 
corresponding formula across all strains. 

Instantiations of the predicate G represent a set of predictor variables, whereas instantiations of the 
predicate E represent a set of target variables ((4]). There are 748 ground predicates of G (assuming we 
handle only one marker in a model) and 1870 ground predicates of E. Each ground predicate corresponds 
to a random variable in the corresponding Markov Random Field (see the previous section for more 
details). Since our MLN contains 10 formulas and there are 374 possible instantiations for each formula, 
the corresponding log-linear Markov Random Field contains 3740 features, one for each instantiation of 
every formula. 

Learning the Weights of MLNs 

Each data point in the original dataset corresponds to one ground predicate (cither E or G in our example) . 
For example, the information that a genotype value of a marker M71 in a strain S13 is equal to A 
corresponds to a ground predicate G(S13, M71, A). Therefore, the original dataset can be represented 
with a collection of ground predicates that logically hold, which in turn is described as a data vector 
d = (d\, , . , , (In), where N is the number of all possible ground predicates (N — 2618 in this example). 
An element di of the vector d is equal to 1, if the «th ground predicate (assuming some order) is true 
and thus is included in our collection, and otherwise. Note that this vector representation is possible 
under a closed world assumption stating that all ground predicates that are not listed in our collection 
are assumed to be false. 

In order to carry out training of a Markov Logic Network we can use standard Newtonian methods 
for likelihood maximization. The learning proceeds by iteratively improving weights of the model. At 
the jth step, given weights w^), we compute V w (3)L(w^)), the gradient of the likelihood, which is our 
objective function that we maximize. Consequently, we improve the weights by moving in the direction 
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of the positive gradient, w« +1 l = w^' + aV w o*) L(w^'), where a is the step size. 
Recall that the likelihood is given by 



L(w 7 = Pr( 7 w) = - exp ^^Wl) = = -= -— . 

z \i J £y e r ex P (Ei ^^(Yj) 



(5) 



where 7 is a state (also called a configuration) of the set of random variables X and T is a space of all 
possible states. Therefore, the log-likelihood is 



logi(w I 7) = logPr(7 I w) = ^2 w i n i{l) - lo S 



Now derive the gradient with respect to the network weights, 



d 



- — logi(w I 7) = 7^(7) 



dw 



E yer [ZPr(Y)] dw 3 



exp [^2 w * n ^') ) 

7'er V i / 

H ex p \ ) 

7'er V i / 



(0) 



Ey er [^Pr(Y)]i 



Ml 51 



,ygi- l v , ,j 7 , er 

= n 3 -( 7 )-X; N(7')i(w| 7 ')]- 
7'er 



7^(7') exp y ^2,w i n i {i) 



(7) 



Note that the sum is computed over all possible variable states 7'. The above expression shows that 
each component of the gradient is a difference between the number of true instances a corresponding 
formula Fj (the number of true ground formulas of Fj) and the expected number of true instances of Fj 
according to the current model. However, computation of both components of the difference is intractably 
large. 

Since the exact number of true ground formulas cannot be tractably computed from data [17] . the 
number is approximated by sampling the instances of the formula and checking their truth values accord- 
ing to the data. 

On the other hand, it is also intractable to compute the expected number of true ground formulas 
as well as the log-likelihood L(w | 7'). The former involves inference over the model, whereas the later 
requires computing the partition function Z = E 7 'gr ex P (Ej w i n i(l'))- One solution, proposed in |17) . 
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is to maximize the pseudo-likelihood 



N 



L ( w I 7) = n Pr (^' I 7mb 3 ;w), 



(8) 



where 7,- is a restriction of the state 7 to the jth ground predicate and 7m is a restriction of 7 to what 
is called a Markov blanket of the jth ground predicate (the state of the Markov blanket according to 
our data). We elected to use this approach. Similar to the original definition of the Markov blanket in 
the context of Bayesian networks [8] , the Markov blanket of a ground predicate is a set of other ground 
predicates that are present in some ground formula. Using the yeast sporulation example, the set of 
ground predicates {Vm, Vg | G(Sl,m, g)} is the Markov blanket of E(S1, 1) due to the knowledge base 
Maximization of pseudo-likelihood is computationally more efficient than maximization of likelihood, 
since it does not involve inference over the model, and thus does not require marginalization over a large 
number of variables. Currently, we use the limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) 
algorithm from the Alchemy implementation of MLNs |21j to optimize the pscudo- likelihood. 

Using MLNs for Querying 

After learning is complete, we have a trained Markov Logic Network that can be used for various types 
of inference. In particular, we can answer such queries as "what is the probability that a ground predicate 
Q is true given that every predicate from a set (a conjunction) of ground predicates Ev = {E\, . . . , E m } 
is true ?" The ground predicate Q is called a query and the set Ev is called an evidence. Answering this 
query is similar to computing the probability Pr(Q | E\ A ... A E m , MLN). Using the product rule for 
probabilities we get 



where Tp is the set of all possible configurations where a ground predicate P is true, and Te = Te 1 H 

...nr Ero . 

Computing the above probabilistic query for majority of real application problems, which include 
the computational problems posed by the complexity experienced in systems biology, is intractable. 



Pr(Q I £1 A ... A E m , MLN) 
E- ( er Q nr E Pr(7 I MLN) 
S 7e r E Pr (7 I MLN) ' 



Pr(Q A Ex A ... A E m \ MLN) 
Pr(£d A . . . A E m I MLN) 



(9) 
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Therefore, we need to approximate Pr(Q | Ei A ... A E m , MLN), which can be done using various 
sampling-based algorithms. 

Markov Logic Networks adopt a knowledge-based model construction approach consisting of two steps: 
1. constructing the smallest Markov Random Field from the original MLN that is sufficient for computing 
the probability of the query, and 2. inferring the Markov Random Field using traditional approaches. 
One of the commonly used inference algorithms is Gibbs sampling where at each step we sample a ground 
predicate Xj given its Markov blanket. In order to define the probability of the node Xj being in the 
state 7j given the state of its Markov blanket we use the earlier notation. 

Given a jth ground predicate Xj, all the formulas containing Xj are denoted by Fj. We denote the 
Markov blanket of Xj as MBj and a restriction of 7 to the Markov blanket (the state of the Markov 
blanket) as 7^ s . . Similarly, a restriction of the state 7 to Xj is denoted as "fj . Recall that each formula 
F of a Markov Logic Network corresponds to a feature of a Markov Random Field, where the feature's 
value is the truth value / of the formula F depending on states 71 , . . . , 7& of the ground predicates 
X\,...,Xk constituting the formula and denoted by / = F| 71i ... )7J .. Note that .F| 71i ... i7i . can also be 
shown as F| 7liWB . Using this notation we can express the probability of the node Xj to be in the state 
7j when its Markov blanket is in the state ^mba as 



For the Gibbs sampling, we let a Markov chain converge and then estimate the probability of a 
conjunction of ground predicates to be true by counting the fraction of samples from the estimated 
distribution in which all the ground predicates hold. The Markov chain is run multiple times in order to 
handle situations when the distribution has multiple local maxima so that the Markov chain can avoid 
being trapped on one of the peaks. One of the current implementations, called Alchemy |21j . attempts 
to reduce the burn-in time of the Gibbs sampler by applying a local search algorithm for the weighted 
satisfiability problem, called MaxWalkSat }2"2"] . 




(10) 
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Components of the Computational Method 

The general overview of the computational method is given in figure [3] At the first step, the method 
traverses the set of all markers and assigns an error score to each marker indicating its predictive power. 
An error score of a marker corresponds to performance of an MLN ^ based on this single marker (a 
random variable m of Q that essentially chooses the location of markers to use in the model has only 
one value the location of this single marker). The algorithm selects markers whose error scores are 
considerably lower than the average: we selected the outliers that are 3 standard deviations below the 
mean. 

The current version of the method greedily selects a marker with the lowest score and appends it to a 
list of fixed markers (which is initially empty) . The algorithm then repeats the search for an outlier with 
the lowest score, but this time a marker's score is computed using an MLN (|4]) that estimates a joint 
effect of the marker under consideration together with all of the currently fixed markers on a phenotype 
(the variable m takes on the locations of all fixed markers and the marker under consideration) . At each 
iteration the algorithms expands the list of fixed markers and rescans all of the remaining markers for 
outliers. The method stops as soon as no more outliers are detected and returns the fixed markers as 
potential loci associated with the phenotype. 

Our scanning for predictive genetic markers can be seen as an instance of the variable selection 
problem. We use cross-validation to compare probabilistic models and select the one with the best fit to 
the data and the smallest number of informative markers. Using cross-validation we assess how a model 
generalizes to an independent dataset, addressing the model overfitting problem and facilitating unbiased 
outlier detection. Particularly, we use if -fold cross-validation which is illustrated in figure EI A) . The 
data set is arbitrarily split into K folds, Di, . . . ,T)r, and K iterations are performed. The iih iteration 
of cross-validation selects Uj^i as a training dataset and as a testing dataset. The model is then 
trained on the training dataset and the performance of the model is assessed using the testing dataset 
resulting in a prediction error. The average of prediction errors from K steps, called a cross-validation 
error, is used as a score of the model. In case of yeast sporulation efficiency dataset introduced earlier, 
we used 11-fold cross-validation (since a population of 374 yeast strains can be evenly partitioned into 11 
subsets with 34 strains in each). The results were generally insensitive to the cross-validation parameters. 

Recall that we distinguish two types of variables, the target variables whose values are predicted, 
and the predictor variables whose values are used to predict the values of the target variables. In the 
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Figure 3. Components of the computational method. This figure illustrates three major 
computational components. We used cross-validation to estimate the goodness of fit of a model. Panel 
A depicts 4-fold cross-validation. The data is partitioned into four sets and at each iteration a model is 
trained on three sets and then tested on the fourth set resulting in four prediction error scores that are 
then averaged for a total cross-validation error. Panel B shows details of model training and testing. 
Using training data we estimate the weights of each formula of an MLN template. The trained MLN is 
evaluated on testing data resulting in an error score. We shuffle the order in the training data and the 
order of the testing data and repeat the training and testing. The error scores after each evaluation are 
averaged to produce the average error score of the MLN. Panel C shows how the components illustrated 
in panels A and B are combined to search for the most informative markers. At the iVth iteration of 
the search, given a data set and set of N — 1 fixed markers, the method traverses the set of all markers 
and evaluates models constructed using the fixed markers and a selected marker. Note that at the iVth 
iteration, iV-marker models consider the fixed markers (N — 1) as possible interactors with the next 
marker we found. The method then selects the outlier model, adds the corresponding marker to the set 
of fixed markers, and repeats the traversal of the markers. The method stops when no outliers are found. 

example above, sporulation efficiency of a yeast strain is a target variable, whereas genotype markers are 
predictor variables. Note that in some cases we can treat variables as both targets and predictors (e.g. 
gene expression in eQTL datasets) . During the evaluation phase in the ith iteration of cross-validation 
we consider the testing dataset D^. Using knowledge-based model construction approach, we build a 
Markov Random Field that is small yet sufficient to infer the values of all target variables in the set 
Dj. The target variables are inferred based on the values of the predictor variables from (see section 
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"Using MLNs for Querying"). 

The model prediction of a target variable X that can take on any value from {x\, X2, £3} can be 
represented as a vector v = (pi,P2,P3), where pj = Pr(X = Xj | Dj,0Mi?,F) is the probability of X to 
take on a value Xj given the testing data Di and the Markov Random Field with parameters Qmrf- On 
the other hand, the actual value of X (provided in the testing dataset D^) can be represented as a vector 
v = (ui, U2, K3}, where Vj 7^ k,Vj = and Vk = 1 iff X = Xk in D^. Then the prediction error should 
measure the difference between the prediction v and the true value v. We used the Euclidean distance 
d(y, v) to compute the prediction error. This approach might make the comparison to other approaches 
difficult since the error can be a value that is not bounded by and 1 , but by and VM, where M is the 
size of the domain of X (3 in our example). Further computation is required to obtain values for model 
accuracy, to explain variance, and other standard characteristics. On the other hand, Euclidean distance 
is certainly sufficient for comparing predictions of different models. 

Due to the approximate nature of learning and inference in MLNs (see sections "Learning the Weights 
of MLNs" and "Using MLNs for Querying"), two structurally identical models, trained on two data sets 
that differ only in the order of the samples, can generate predictions with slight differences. This is 
due to the fundamental path-dependency of learning and inference algorithms in knowledge-based model 
construction. For example, the order of training data affects the order in which the Markov Random 
Field is built, which in turn affects the way the approximate reasoning is performed over the field. Path- 
dependency introduces artificial noise into predictions and considerably reduces our ability to distinguish 
a signal with a small magnitude (such as a possible minor effect of a genetic locus on a phenotype) from 
a background. 

In order to reduce the effect of path-dependency on overall model prediction we shuffle the input data 
set and average the resulting predictions. We employed an iterative approach, based on shuffling, for 
denoising. At each iteration the model is retrained and reevaluated on newly shuffled data and the running 
mean of the model prediction is computed (see figure [3JB)). The method incrementally computes the 
prediction average until achieving convergence, namely until the difference between the running average 
at the two consecutive iterations is smaller than Th for W consecutive steps. The parameters Th and 
W arc directly connected with the total amount of shuffling and re-estimation performed as illustrated 
in figure |U 

In order to perform rigorous denoising, we select a lower value for the threshold Th (0.0001) and a 
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Figure 4. The amount of shuffling depends on the threshold and the size of the window. 

Note that the number of iterations until convergence of the algorithm increases when the threshold 
decreases or the window size increases. Note also that selecting tight stopping parameters tends to 
allow the algorithm to identify more informative markers. 



larger window size W (10). The shuffling-based denoising procedure is applied at each iteration of the 
cross-validation. Averaging of the predictions after data shuffling reduces the amount of artificial noise 
enabling the overall method for detection of genetic loci to distinguish markers with a smaller effect on 
the phenotype (the algorithm detects more informative markers as illustrated in figure [4]). 

There are many different strategics to search for the most informative subset of genetic markers. In 
this section we used a greedy approach in order to illustrate the general MLN-based modeling framework 
presented in this paper. In the next section we show that MLN-bascd modeling that accounts for depen- 
dencies between markers through joint-inference allows us to find interesting biological results by using 
a greedy search method. In order to be confident that the fixed markers are meaningful, we manually 
selected markers at each iteration of the search from the set of outliers and arrived at a similar set of 
candidate loci (within the same local region). 



Results 

The analyses presented in this paper are based on the dataset from |19| containing both phenotype 
(sporulation efficiency) and genotype information of yeast strains derived from one intercross. The results 
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are obtained using our method that searches for the largest set of genetic markers with the strongest 
compound effect on the phenotype. All the detailed information on the computational components of 
our method is presented in the Methods section. 

In their paper Gerke et al. [19] identified 5 markers that have an effect on sporulation efficiency 
including 2 markers whose effect seems to be very small. Moreover, Gerke et al. provided evidence for 
non-linear interactions between 3 major loci. The presence of confirmed markers with various effect and 
non-linear interactions make the dataset from |19| an ideal choice for testing our computational method. 

Our method allows us to define and to use essentially any genetic model. First we used a simple 
regression-type model that mimics simple statisical approaches, like GWAS. At the first stage, the method 
compares the markers according to their individual predictive power. The amount of the effect of a marker 
on the phenotype is estimated by computing a prediction error score from a regression model based solely 
on this marker. The top line on the left panel in figure 5 illustrates the error scores of all markers ordered 
by location (X axis). In figure [5] we observe three loci (around markers 71, 117, and 160) with the 
strongest effect on sporulation efficiency, which were identified in |19j . 

At the next stage, our method adds the marker with the strongest effect (marker 71) to all the 
following models. This allows us to compare all other markers according to their predictive power in 
conjunction with the fixed marker 71. This time the prediction error score of a marker indicates how 
well this marker together with the marker 71 predicts the sporulation efficiency. The score is, thus, 
computed from the regression model based on two markers: the marker 71 and a second marker. This is 
an important distinction from traditional GWAS where the searches for multiple influential markers are 
performed independently from each other. In our approach, using MLNs and in particular joint-inference, 
the compound effect of markers is estimated allowing us to see possible interactions between markers. 

The method continues the iterations and selects 11 markers before the error no longer improves 
sufficiently and the computation stops. Among the selected markers, 5 are the same loci previously 
identified in [TO] (markers 71, 117, 160, 123, 79), 3 are the markers next to the loci with the strongest 
effect (markers 72, 116, 161), and 3 are the new markers that have not been reported before (markers 57, 
14, 20). In addition, the method identifies another marker (marker 130) as a candidate for a locus that 
has an effect on sporulation efficiency, although this marker was not selected due to its weak predictive 
power. Notice that even with a relatively simple model, such as logistic regression, and a quite stringent 
criterion for outliers (3 standard deviations from the mean, a p-valuc 0.003 for a normal distribution) we 



19 




Markers (IDs) 



Figure 5. Prediction of sporulation efficiency using a subset of markers. This figure shows 
the execution of the algorithm plotted as error scores vs markers for a number of models. The top line 
in the left plot shows error scores of the models based only on a single marker plotted on the X-axis. A 
green star indicates the outlier marker with the smallest error and purple stars depict other outliers 
(markers whose error differs from the mean error by 3 standard deviations). The second line shows the 
error scores of the models based on a corresponding marker together with the previously selected 
marker (indicated with the green star). All the following lines are interpreted in the similar way. The 
left plot shows five markers with a large effect. The rest of the identified markers (six markers with a 
small effect) arc illustrated in the right plot. 

are able to exceed the number of identified candidate loci. We argue that our method is more efficient 
at discovering markers with a very low individual effect on phenotype that have non-trivial interactions 
with other sporulation-affecting loci due to the use of joint-inference of MLNs. 

There are several distinct properties of our method that are important to note. First, although 
the method selects the neighboring markers of the three strongest loci, it does not select a neighbor 
immediately after the original loci has been identified, because there arc better markers to be found. For 
example, after selecting the first marker 71, the method finds markers 117 and 160, and only then selects 
marker 72, which is the neighbor of 71. The method selects the next strongest marker at each stage that 
maximally increases the compound effect of selected markers. Second, our method docs not find markers 
that do not add sufficient predictive power. The criterion for outliers determines when the method stops 
and determines the confidence that the added markers have a real effect on the phenotype. 
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For each new marker (57, 14, 20, 130) we examined all genes that were nearby (different actual 
distances were used). For example for the marker 14 we considered all genes located between markers 
13 and 15. Table Q] shows genes located near these newly identified markers that are involved in either 
meiosis or in sporulation. 

The simple logistic regression-type model that was used can be summarized using the first-order 
formula G(strain, m, g) =>■ E(strain, v), which captures the effect of a subset of markers on phenotypc. 
In order to investigate gene-gene interactions we used a pair-wise model which can be summarized by the 
formula G(strain, ml, gl) A G(strain, m2, g2) E(strain, v). The pair-wise model subsumes the simple 
regression model, since whenever ml and m2 are identical, the pair-wise MLN is mapped to the same set 
of cliques as those from the simple MLN. However, the pair-wise model defines the dependencies between 
two loci and a phenotype that are mapped to an additional set of 3-node cliques. The pair-wise model 
allows us explicitly to account for the pair-wise gene interactions. When using the pair-wise model, the 
joint-inference is performed over an MLN where possible interactions between two markers are specified 
with first-order formulas. 

The assumption inherent in genome-wide analyses is that a simple additive effect can be observed 
when applying the pair- wise model to loci that do not interact: the compound effect is essentially a sum of 
the individual effects of each locus. On the other hand, for two interacting markers, the pair-wise model 
is expected to predict a larger-than-additive compound effect. Since the pair-wise model incorporates 
possible interactions, the prediction error of this model should be smaller than the error of a simple model 
by a factor that corresponds to how much the interaction information helps to improve the prediction. 

By using the pair-wise model, we investigated the presence of interactions between markers 71, 117, 
160 which correspond to loci with the strongest effect on sporulation efficiency. We denote the prediction 
error of a simple regression model based on markers Mi, . . . , M„ as errors (Mi, . . . , M„), and the error 
of a pair- wise model based on markers Mi, . . . , M„ as error py/{M\, . . . , M„). Figure [5] compares the 
prediction errors of the simple regression model based on one marker (red line) and the errors of the pair- 
wise model based on two markers one of which is preset to 71 (blue line). Note that the baseline prediction 
error of the pair- wise model is the same as errors (71), which means that on average the choice of a second 
marker in the pair-wise model does not affect the prediction. There are, however, 2 markers that visibly 
improve the prediction, namely markers 117 and 160. Note that the prediction error error pyy (71, 160) 
(the right blue peak) is lower than errors (160) (the rightmost red peak) by only a value roughly equal to 
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Figure 6. Investigating the 71-117 loci interaction. This figure compares a standard 
genome-wide scan made by a simple regression model based on a single marker (red line) and a scan 
made by a pair- wise model based on two markers, one of which is preset to 71 (blue line). The green 
lines represent the size of the leftmost red peak corresponding to the difference d between the baseline 
prediction error of the simple model and the error errors (71). Pink bars represent how much the 
difference between errors (117) and error pw (71, 117) is larger than d. The large size of the leftmost 
pink bar indicates a strong interaction between markers 71 and 117. 



the difference between the average prediction errors of simple and pair-wise models (this value is equal to 
the size of the leftmost red peak). The reduction of the prediction error, when combining markers 71 and 
160, is additive, suggesting that there is no interaction between these two markers. On the other hand, if 
we look at the effect of combining markers 71 and 117, we can see that the prediction improvement using 
the pair-wise model based on both markers (the size of the left blue peak) is considerably more than just 
a sum of prediction improvements of two simple models independently (the leftmost and the middle red 
peaks). The non-additive improvement suggests that there is an interaction between the markers 71 and 
117. 

Figures [JJ and [5] show a similar analysis to that illustrated in figure [5] performed on the other two 
markers. The analysis shown in figure [7] confirms the interactions between markers 71 and 117 and, 
additionally, suggests that there is an interaction between markers 117 and 160. Figure [5] confirms the 
interaction between 117 and 160 and the absence of the interaction between 71 and 160. One can see from 
figure[5]that the leftmost blue peak indicating errorpvy(71, 160) is a sum of errors(71) and errors(l60) 
(the pink bar next to the left pink star is extremely short). On the other hand, the rightmost blue peak 
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Figure 7. Investigation of 117-71 and 117-160 loci interactions. This figure compares a 
standard genome-wide scan using a single-marker model and a scan using a pair-wise model based on 
two markers, one of which is preset to 117. Sec the caption of figure[6]for more details. Both pink bars 
indicate the presence of non-additive interactions between markers 117 and 71 and markers 117 and 160. 

is a lot more than just a sum of individual errors (the pink bar is tall). In fact, errorpw (117, 160) is 
almost the same as error pw (71, 160). 

The two predicted interactions, 71-117 and 117-160, were experimentally identified in (19,. The 
strength of these interactions is significant enough to immediately stand out during the analysis in 
figurc[6] We next applied this analysis to the set of all nine identified loci (71, 117, 160, 123, 57, 14, 130, 
79, 20) in order to quantify possible interactions between every pair of markers. For each two markers A 
and B from the set of loci we compute the prediction errors of a simple model based solely on either A 
or B, denoted as errors (A) and error s(B). We also compute the prediction error of a pair- wise model 
based on both A and B, denoted as error p\y(A,B). Consequently the size of a possible interaction 
between A and B, denoted as i(A, B), is estimated using the following expression: 

i(A, B) = d(A, B) - d(A) - d(B) 

— [median — errorp\v(A,B)) — (median — error s(A)) — (median — errors(B)) (H) 
= error s(A) + error s(B) — error pw (A, B) — median. 

Here median is a baseline of prediction error of a simple model based on a single marker. We averaged 
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Figure 8. Investigating an interaction between markers 117 and 160. This figure compares a 
standard genome-wide scan using a single-marker model and a scan using a pair-wise model based on 
two markers, one of which is preset to 160. See the caption of figure [6] for more details. Note 
errorpw(Jl, 160) that is almost the same as errorpw (117, 160) even though errors(71) is considerably 
lower than errors (117). The tall pink bar on the right side of the figure indicates a non-additive 
interaction between markers 117 and 160. On the other hand, the pink bar on the left is almost 
non-existent indicating absence of an interaction between markers 71 and 160. 

the errors over 10 independently computed iterations. We next determined how high the value i(A, B) 
should be in order to confidently predict an interaction between markers A and B. We selected 36 pairs 
of randomly selected markers that were not from the set of nine informative markers and computed 
i(A, B) for each pair. Since we do not expect any interactions between random, non-informative markers, 
their i(A, B) values are used to estimate a confidence interval for no interaction. We computed a mean 
and standard deviation of the set of i(A, B) values corresponding to the randomly chosen markers. It is 
estimated that a value of 2.54 standard deviations away from the mean completely covers the set of i(A, B) 
values for all random markers, and we therefore argue there is a strong likelihood of interaction between 
markers A and B whenever i(A, B) is more than 3 standard deviations away from the mean. Whenever 
i(A, B) is less than 3 but more than 2.54 standard deviations away from the mean, we argue there is a 
probable interaction between A and B. Estimated interactions between identified loci are illustrated as a 
network of marker interactions in figure [9] where the color of each link represents the level of confidence of 
the corresponding interaction. We repeated the estimation of interactions by randomly selecting another 
36 pairs of markers and computing the confidence intervals for the new set (2.32 standard deviations). 
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The probable interactions identified from the first experiment were confirmed in the second experiment. 
Two possible interactions, however, were identified in one experiment but not the other and are depicted 
in figure |H] with dashed links. This is a result of a slightly shifted mean of the second set of 36 random 
marker pairs relative to the first set, and the marginal size of the effect. Two interactions with very 
large i(A, B) values, 71-117 and 117-160, were previously identified in QjJ]. We also found several smaller 
interactions illustrated in figure [9] that have not been identified before. Note that since we measure 
absolute values i(A,B), it is not a surprise that interactions 71-117 and 117-160 are so large, since the 
corresponding loci (71, 117, 160) are by far the strongest. Locus 117, which is involved in the strongest 
interactions, corresponds to the gene IME1 |19j . the master regulator of meiosis (www.yeastgenome.com) . 
Since IME1 is a very important sporulation gene, it is entirely reasonable that this gene is central to our 
interaction network (figure [9]). 
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Figure 9. Estimated network of gene-gene interactions. This figure shows a network of 
estimated interactions between loci based on genetic data and sporulation phenotype. The color of the 
links corresponds to the level of confidence of interactions. The three darkest colors are associated with 
the most likely interactions. The fourth color is associated with an interaction 123-117 that scored high 
in the first experiment (more than 3 standard deviations away from the mean), but lower in the second 
experiment (2.5 standard deviations), although still above the confidence level (2.32 standard 
deviations). Two possible interactions depicted with the dashed lines with the lightest color scored 
above the confidence level in the first experiment (2.54 standard deviations), but below the confidence 
level in the second experiment (1 standard deviation). Loci 160, 71, 117, 123, 79 were previously 
identified in [19]. Moreover, the genes of the three major loci were detected: IME1 (locus 117), RME1 
(locus 71), and RSF1 (locus 160) [15]. The two strongest interactions, 160-117 and 117-71, were also 
identified in [19]. 
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Discussion 

The method presented in this paper provides a framework for using virtually any genetic model in a 
genome-wide study because of the high representational power of MLNs. This power stems from the use 
of general, first-order logic conjoined to probabilistic reasoning. Moreover, the use of knowledge-based 
model approaches that build models based on both data and a relevant set of first order formulas [23] 
allows us to efficiently incorporate prior biological knowledge into a genetic study. The generality of 
MLNs allows greater representational power than most modeling approaches. The general approach can 
be viewed as a seamless unification of statistical analysis, model learning and hypothesis testing. 

As opposed to standard genome-wide approaches to genetics which assume additivity, the aim of the 
method described in this article is not to return values corresponding to the strength of individual effects 
of each marker. Our method aims at discovering the loci that are involved in determining the phenotypc. 
The method computes the error scores for each marker in the context of the others representing the 
strength of each marker's effect in combination with other markers. It will be valuable in future to derive 
a scoring technique for markers that can be used directly to compare with results of the traditional 
approaches. In general, our approach provides a way of searching for the best model predicting the 
phenotypc from the genetic loci. Since the model and the corresponding joint-inference methodology 
incorporate the relations between the model variables, we are able to begin a quantitative exploration of 
possible interactions between genetic loci. 

Our method shows promise in that it can accommodate complex models with internal relationships 
among the variables specified. The development of a succinct and clear language and grammar based on 
FOL for the description of (probabilistic) biological systems knowledge will be critical for the widespread 
application of this method to genetic analyses. Achieving this goal will also represent a significant step 
toward the fundamental integration of systems biology and the analysis and modeling of networks with 
genetics. Additionally, the development of the biological language describing useful biological constraints 
can alleviate the computational burden associated with model inference. MLN-based methods, such as 
ours, that perform both logic and probabilistic inferences are computationally expensive. While increases 
in computing power steadily reduce the magnitude of this problem, there are other approaches that 
will be necessary. Given such a focused biological language, we could tailor the learning and inference 
algorithms specifically to our needs and thus reduce the overall computational complexity of the method. 
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Another future direction will be to find fruitful connections between a previously developed information 
theory-based approach to gene interactions [23] with this Al-derived approach. There are clear several 
other applications of this approach in the field of biology. It is clear that there are, for example, many 
similarities between the problem discussed here and the problems of data integration. 

Biomarker identification, particularly for panels of biomarkers, is another important problem that 
involves many challenges of data integration and that can benefit from our MLN-bascd approach. Similar 
to GWAS or QTL mapping, where we search for genetic loci that are linked with a phenotype of interest, 
in biomarker detection we search for proteins or miRNAs that are associated with a disease state in a set 
of patient data. Just as in genetics, we can represent biomarkers as a network because there are various 
underlying biological mechanisms that govern the development of a disease. Often the most informative 
markers from a medical point of view have weak signals by themselves. MLNs can allow us to incorporate 
partial knowledge about the underlying biological processes to account for the inter-dependencics making 
the detection of the informative biomarkers more effective. It is clear that the approach described here 
has the potential to integrate the biomarker problem with human genetics, a key problem in the future 
development of personalized medicine. 
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Table 1. Candidate genes near the new informative markers. 



Marker 


Euclidean 
Error 


Coord. 


Candidate 
Genes 


GO 


Description 


57 


0.7061 


6, 

103743 


YFL039C, 
ACTl 
YFL037W, 
TUB2 

YFL033C, 
RIM15 

YFL029C, 
CAKl 

YFL009W, 
CDC4 

YFL005W, 
SEC4 


S 
M 

M 

M 
M 

S 


Actin, structural protein involved in cell polariza- 
tion, endocytosis, and other cytoskeletal functions 
Beta-tubulin; associates with alpha-tubulin 
(Tublp and Tub3p) to form tubulin dimer, which 
polymerizes to form microtubules 
Glucose-repressiblc protein kinase involved in sig- 
nal transduction during cell proliferation in re- 
sponse to nutrients, specifically the establishment 
of stationary phase; identified as a regulator of 
IME2; substrate of Pho80p-Pho85p kinase 
Cyclin-dcpcndcnt kinasc-activating kinase required 
for passage through the cell cycle, phosphorylates 
and activates Cdc28p 

F-box protein required for Gl/S and G2/M transi- 
tion, associates with Skplp and Cdc53p to form a 
complex, SCFCdc4, which acts as ubiquitin-protein 
ligase directing ubiquitination of the phosphory- 
lated CDK inhibitor Siclp 

Secretory vesicle-associated Rab GTPase essential 
for exocytosis 


14 


0.7009 


2, 

656824 


YBR180W, 
DTR1 

YBR186W, 
PCH2 


S 
M 


Putative dityrosine transporter, required for spore 
wall synthesis; expressed during sporulation; mem- 
ber of the major facilitator superfamily (DHAl 
family) of multidrug resistance transporters 
Nucleolar component of the pachytene checkpoint, 
which prevents chromosome segregation when re- 
combination and chromosome synapsis are defec- 
tive; also represses meiotic interhomolog recombi- 
nation in the rDNA 


130 


0.7010 


11, 

447373 


YKR029C, 
SET3 

YKR031C, 
SP014 


M 
S 


Defining member of the SET3 histonc dcacetylase 
complex which is a meiosis-specific repressor of 
sporulation genes; necessary for efficient transcrip- 
tion by RNAPII 

Phospholipase D, catalyzes the hydrolysis of phos- 
phatidylcholine, producing choline and phospha- 
tidic acid; involved in Secl4p-indcpendcnt secre- 
tion; required for meiosis and spore formation; dif- 
ferently regulated in secretion and meiosis 


20 


0.6972 


3, 

188782 


YCR033W, 
SNT1 


M 


Subunit of the Set3C deacetylase complex that in- 
teracts directly with the Set3C subunit, Sif2p; pu- 
tative DNA-binding protein 



The list of genes located near the new markers identified by the MLN-based method. The table shows 
only the genes that are involved in sporulation or meiosis. Specific information for the genes can be 
found at www.yeastgenome.org. 



